audio <- fread("audiofeatures_clean.csv")
artists <- fread("artist_df.csv")[,-1]
data <- merge(artists, audio, by.x = 'ID', by.y = 'artist_uri')
data <- na.omit(data) #remove NAs (there were less than 5 rows)
data = data %>% mutate(Generation = floor(Generation), duration_mmss = format(as.POSIXct(Sys.Date()) + duration_ms/1000, "%M:%S"), duration_ms = duration_ms/60000) %>%
select(-artist) %>%
rename(Artist_uri = ID,
ArtistType = Type,
ArtistGender = Gender,
ArtistGen = Generation,
ArtistDebut = DebutYear,
duration = duration_ms)
#albums <- unique(select(data, Artist, album))
# head(data, n=5)
There are intro and outro songs on albums that serve as aesthetic tracks to tie the entire album together as one artistic work. However, these intro and outro songs do not serve to categorize the kpop music genre as they simply act like 'filler' works and not ones that are candidates to actively be promoted in the commercial music market.
Therefore, we would like to remove these types of tracks from the dataset as they are the source for skewness in instrumentalness and speechiness being on the high end and for causing skewness in song duration from being on the low end.
remove songs that have the words: intro, outro, interlude in the song name, songs that are longer than 10 minutes. And remove songs less than or equal to 2 minutes if they are outliers in the areas of instrumentalness or speechiness.
This does not remove every single 'filler song' but it removes most of them.
remove <- data %>% filter(str_detect(song_name, "intro") | str_detect(song_name, "outro") | str_detect(song_name, "interlude") | duration >= 10 | (duration <= 2 & (instrumentalness >= 0.50 | speechiness >= 0.40 | speechiness == 0)))
nrow(remove)
## [1] 330
data <- data[!data$song_uri %in% remove$song_uri, ]
#fwrite(data, "kpopdata.csv")
Let's look at a summary
summary(data)
## Artist_uri Artist ArtistType ArtistGender
## Length:12062 Length:12062 Length:12062 Length:12062
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## ArtistGen ArtistDebut song_name song_uri
## Min. :1.000 Min. :1992 Length:12062 Length:12062
## 1st Qu.:2.000 1st Qu.:2007 Class :character Class :character
## Median :2.000 Median :2010 Mode :character Mode :character
## Mean :2.261 Mean :2009
## 3rd Qu.:3.000 3rd Qu.:2015
## Max. :4.000 Max. :2020
## album album_uri release_date popularity
## Length:12062 Length:12062 Length:12062 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.:11.00
## Mode :character Mode :character Mode :character Median :23.00
## Mean :25.82
## 3rd Qu.:39.00
## Max. :94.00
## duration acousticness danceability energy
## Min. :0.3305 Min. :0.0000038 Min. :0.0733 Min. :0.000923
## 1st Qu.:3.3059 1st Qu.:0.0331250 1st Qu.:0.5910 1st Qu.:0.681000
## Median :3.5666 Median :0.1100000 Median :0.6740 Median :0.813000
## Mean :3.6188 Mean :0.1960758 Mean :0.6581 Mean :0.769385
## 3rd Qu.:3.8985 3rd Qu.:0.2837500 3rd Qu.:0.7400 3rd Qu.:0.896000
## Max. :9.2587 Max. :0.9900000 Max. :0.9770 Max. :0.999000
## instrumentalness key liveness loudness
## Min. :0.0000000 Min. : 0.000 Min. :0.0116 Min. :-27.040
## 1st Qu.:0.0000000 1st Qu.: 2.000 1st Qu.:0.0932 1st Qu.: -5.473
## Median :0.0000000 Median : 5.000 Median :0.1380 Median : -4.202
## Mean :0.0081971 Mean : 5.221 Mean :0.1928 Mean : -4.537
## 3rd Qu.:0.0000014 3rd Qu.: 8.000 3rd Qu.:0.2750 3rd Qu.: -3.167
## Max. :0.9510000 Max. :11.000 Max. :0.9830 Max. : 0.394
## mode speechiness tempo time_signature
## Min. :0.0000 Min. :0.02200 Min. : 47.6 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.03860 1st Qu.:102.0 1st Qu.:4.000
## Median :1.0000 Median :0.05510 Median :121.9 Median :4.000
## Mean :0.6111 Mean :0.07749 Mean :121.5 Mean :3.977
## 3rd Qu.:1.0000 3rd Qu.:0.09140 3rd Qu.:135.0 3rd Qu.:4.000
## Max. :1.0000 Max. :0.95700 Max. :218.1 Max. :5.000
## valence duration_mmss
## Min. :0.0349 Length:12062
## 1st Qu.:0.4350 Class :character
## Median :0.6160 Mode :character
## Mean :0.5962
## 3rd Qu.:0.7680
## Max. :0.9840
release date distribution
#gghist(data$release_date, 'release date')
The popularity measure from Spotify is defined as: > The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.
popularity distribution
gghist(data$popularity, 'popularity')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The mode for popularity of songs in the entire dataset is around 10, and the majority of songs in the dataset have a popularity score of below 50. In otherwords, very few songs in the dataset have very high popularity scores. This is appropriate since Spotify's popularity score incorporates how recent the plays are (more recent the more popular). The vast majority of songs in the dataset older, released prior to 2018. The share of data that would be considered 'recent' is much smaller. Therefore, what we are seeing in the distribution is to be expected.
popularity by generation
ggplot(data, aes(data$popularity, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('popularity') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Again, due to the time sensitivty of the popularity score, we can see that the center of the data for each generation moves higher as the generation increases (to newest generation of kpop).
ggplot(data = data, aes(popularity)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Popularity') +
theme_minimal()
by generation
ggplot(data = data, aes(x = popularity, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Popularity') +
theme_minimal()+
facet_wrap(~ArtistGen)
Duration is the track length measured by Spotify in milliseconds. However, I have converted the measurement to minutes (duration in milliseconds / 60000) in order to make the analysis more interpretable. Note this calculation of minutes is not equivalent to the MM:SS format, but it is a close approximation.
duration distribution
gghist(data$duration, 'duration (minutes)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As you can see the majority of tracks are between 2.5 - 5 minutes long. This is typical as most pop songs are 2-5 minutes. Upon investigation, there are two significantly long songs that have been removed.:
* Turbo's non-stop summer dj remix which is 22 minutes long. Likely a track used to play at clubs or party events.
* Orange Caramel's magic - origin at 13 minutes.
The rest of the songs are 9 minutes or less.
duration by generation
ggplot(data, aes(data$duration, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('duration (minutes)') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Overall, the distributions look farily similar. However, songs from the first generation tended to have longer songs 4 minutes and above.
ggplot(data = data, aes(duration)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Duration (minutes)') +
theme_minimal()
quantile(data$duration)
## 0% 25% 50% 75% 100%
## 0.3305333 3.3058667 3.5666250 3.8985458 9.2586667
by generation
ggplot(data = data, aes(x = duration, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Duration (minutes)') +
theme_minimal()+
facet_wrap(~ArtistGen)
A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
acousticness distribution
gghist(data$acousticness, 'acousticness')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gghist(sqrt(data$acousticness), 'acousticness square root')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Overall, kpop music is heavily influenced by styles like edm, hip hop, electronic, tropical house...etc.... (find a resource to make a conclusive statement about this?). Therefore the backing tracks for the singers use a lot of electronic sounds and instrumentation. It is appropriate for this dataset that the acousticness feature to be skewed to the right where majority of the tracks have acousticness levels below 0.25
acousticness by generation
#calling density but converitng back to relative frequency.
ggplot(data, aes(data$acousticness, group = ArtistGen)) +
geom_histogram(bins = 30, aes(y = stat(width*density)),
fill = 'white', col = 'black') +
xlab('acousticness') +
scale_y_continuous(labels = percent_format())+
facet_wrap(~ArtistGen) +
theme_minimal()
ggplot(data, aes(x = data$acousticness,y = ..density.., group = ArtistGen)) +
geom_histogram(bins = 30, fill = 'white', col = 'black') +
xlab('acousticness') +
theme_minimal() +
facet_wrap(~ArtistGen) +
geom_density(col = 'red')
ggplot(data = data, aes(acousticness)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Acousticness') +
theme_minimal()
by generation
ggplot(data = data, aes(x = acousticness, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Acousticness') +
theme_minimal()+
facet_wrap(~ArtistGen)
Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
danceability distribution
gghist(data$danceability, 'danceability')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Kpop is well known for its focus on eye catching choreography and dance to accompany the song. [provide some resources describing this?]. Therefore it is understandable that we are observing the majority of the tracks to be a majority above 0.50 on the danceability and low frequency tails below. The distribution looks negative skewed normal.
Danceability by generation
ggplot(data, aes(data$danceability, group = ArtistGen)) +
geom_histogram(bins = 30, aes(y = stat(width*density)),
fill = 'white', col = 'black') +
xlab('danceability') + ylab("relative frequency") +
scale_y_continuous(labels = percent_format())+
facet_wrap(~ArtistGen)+
theme_minimal()
ggplot(data = data, aes(danceability)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Danceability') +
theme_minimal()
by generation
ggplot(data = data, aes(x = danceability, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Danceability') +
theme_minimal()+
facet_wrap(~ArtistGen)
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
energy distribution
gghist(data$energy, 'energy')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
energy by generation
ggplot(data, aes(data$energy, group = ArtistGen)) +
geom_histogram(bins = 30, aes(y = stat(width*density)),
fill = 'white', col = 'black') +
xlab('energy') + ylab("relative frequency") +
scale_y_continuous(labels = percent_format()) +
theme_minimal() +
facet_wrap(~ArtistGen)
ggplot(data = data, aes(energy)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Energy') +
theme_minimal()
by generation
ggplot(data = data, aes(x = energy, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Energy') +
theme_minimal()+
facet_wrap(~ArtistGen)
Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
instrumentalness distribution
gghist(data$instrumentalness, 'instrumentalness')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gghist(log(data$instrumentalness), 'instrumentalness (log transform)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8850 rows containing non-finite values (stat_bin).
instrumentalness by generation
ggplot(data, aes(data$instrumentalness, group = ArtistGen)) +
geom_histogram(bins = 30, aes(y = stat(width*density)),
fill = 'white', col = 'black') +
xlab('instrumentalness') + ylab('relative frequency') +
scale_y_continuous(labels=percent_format()) +
theme_minimal() +
facet_wrap(~ArtistGen)
ggplot(data, aes(log(data$instrumentalness), group = ArtistGen)) +
geom_histogram(bins = 30, aes(y = stat(width*density)),
fill = 'white', col = 'black') +
xlab('instrumentalness (log transform)') + ylab('relative frequency') +
scale_y_continuous(labels=percent_format()) +
theme_minimal() +
facet_wrap(~ArtistGen)
## Warning: Removed 8850 rows containing non-finite values (stat_bin).
ggplot(data = data, aes(log(instrumentalness))) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Instrumentalness (log)') +
theme_minimal()
## Warning: Removed 8850 rows containing non-finite values (stat_boxplot).
by generation
ggplot(data = data, aes(x = log(instrumentalness), group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Instrumentalness (Log)') +
theme_minimal()+
facet_wrap(~ArtistGen)
## Warning: Removed 8850 rows containing non-finite values (stat_boxplot).
The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
key distribution (categorical 1-12)
gghist(data$key, 'musical key')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
surprise surprise, the majority of songs use the key of C.
Key by Generation
ggplot(data, aes(data$key, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('musical key') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
During the data collection and cleaning process, I intentionally omitted any live performance recording of a song from the dataset. This is because these live performances are just a duplication of the originially commercially released song.
Any song detected as live is just subject to the Spotify algorithm detection and is not actually a live performance. This variable will be removed from analysis.
liveness distribution
gghist(data$liveness, 'liveness')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(data$liveness, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('liveness') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data, aes(liveness)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Liveness') +
theme_minimal()
by generation
ggplot(data = data, aes(x = liveness, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Liveness') +
theme_minimal()+
facet_wrap(~ArtistGen)
The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
loudness distribution
gghist(data$loudness, 'loudness')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(data$loudness, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('loudness') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data, aes(loudness)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Loudness') +
theme_minimal()
by generation
ggplot(data = data, aes(x = loudness, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Loudness') +
theme_minimal()+
facet_wrap(~ArtistGen)
Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
mode distribution (also categorical)
gghist(data$mode, 'musical mode')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(data$mode, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('musical mode') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
speechiness distribution
gghist(data$speechiness, 'speechiness')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gghist(log(data$speechiness), 'speechiness (log transform)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Because this dataset is of songs only and no podcasts or book readings, we would expect this heavily skewed shape of the distribution where the majority of the data is under 0.25 for speachiness. However, it is very difficult to see the details of the long right tail. For this reason, we will also investigate the distribution on the log scale. On the log scale, the data is still strongly skewed to the right, however, we can see more details in the right tail for the speechiness values above 0.25 (normal scale). There's a gradual decline in songs that have high levels of speechiness on the track, however, we can see a clear drop off for log values from -1 to 0. These are likely to be oddities of the data. Perhaps there are some spoken tracks between songs on albums, serving a similar artistic purpose as instrumental intro/outro tracks.
ggplot(data, aes(data$speechiness, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('speechiness') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(log(data$speechiness), group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('speechiness (log transform)') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, we are looking at the histograms by kpop generation. We can see that all are heavily right skewed just like the overall distribution. However we can see some differences between generations. For example, kpop generation 2 has the least amount of speechiness amongst tracks with the highest bar at speachiness levels at or close to zero. The 4th generation has a higher center of speechiness than the rest of the generations. Perhaps there is more rap incorporated into the music that is released in the 4th generation of kpop than the other eras.
However, behond these observations, it is difficult to compare the distribution for each generation amongst the higher levels of speechiness. For this, we will look at the distribution on the log scale. As observed for the 4th generation, not only is the center of the distribution higher on speechiness, but the upper tail is shorter than the 1st and 2nd generation while maintaining higher concentration of its distribution between log scale -2 and -1. The generation that also has a shorter tail to the right is generation 2. But not only is the tail shorter (ending at around log sclaed value -1), but the density of songs at speechiness levels from about -2.5 to -1 are consistently dropping and seem to be lower than other generations.
Generation 1 seems to have the most variation in speechiness with a similar center to generation 3, but greater variation in the density for speechiness levels of -2 to 0 on the log scale. It has the highest levels of speechiness as well. This level of variation could be attributed to the experimental nature of music in the first generation. This was the time that kpop was starting to become the model and style of what we listen to in the modern era. However, many music companies and artists were still trying to define their sound and fit into the demand of the market. The generally high levels of speechiness could be due to rap and hiphop elements being heavily influenced into the kpop genre at the time which has heavier levels of speechiness than the generic pop style.
ggplot(data = data, aes(log(speechiness))) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Speechiness (log)') +
theme_minimal()
by generation
ggplot(data = data, aes(x = log(speechiness), group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Speechiness (log)') +
theme_minimal()+
facet_wrap(~ArtistGen)
The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
tempo distribution
gghist(data$tempo, 'tempo')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As expected, most songs have a tempo between 90 and 160. Any song with a tempo above 90 is fast at an allegro pace.
ggplot(data, aes(data$tempo, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('tempo') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data, aes(tempo)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Tempo') +
theme_minimal()
by generation
ggplot(data = data, aes(x = tempo, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Tempo') +
theme_minimal()+
facet_wrap(~ArtistGen)
An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
time_signature distirbution (categorical...)
gghist(data$time_signature, 'time signature')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(data$time_signature, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('time signature') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
valence distribution
gghist(data$valence, 'valence')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(data$valence, group = ArtistGen)) +
geom_histogram(aes(y = stat(density)),
fill = 'white', col = 'black') +
xlab('valence') +
theme_minimal() +
facet_wrap(~ArtistGen)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data, aes(valence)) +
geom_boxplot(fill = 'white', col = 'black', outlier.color = 'red') +
xlab('Valence') +
theme_minimal()
by generation
ggplot(data = data, aes(x = valence, group = ArtistGen)) +
geom_boxplot(fill = 'white', col = 'black',
outlier.color = 'red') +
xlab('Valence') +
theme_minimal()+
facet_wrap(~ArtistGen)
numdata <- select(data, ArtistGen, ArtistDebut, popularity, duration, acousticness, danceability, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence)
cor(numdata)
## ArtistGen ArtistDebut popularity duration
## ArtistGen 1.000000000 0.906463428 0.572507933 -0.193551800
## ArtistDebut 0.906463428 1.000000000 0.566633425 -0.222470200
## popularity 0.572507933 0.566633425 1.000000000 -0.146654469
## duration -0.193551800 -0.222470200 -0.146654469 1.000000000
## acousticness 0.013167726 0.048776911 -0.010959230 0.106546635
## danceability -0.035442789 -0.053285013 0.002225069 -0.207410267
## energy 0.063873042 0.044943677 0.029427259 -0.151948909
## instrumentalness -0.017178344 -0.022263366 -0.103616606 -0.065392822
## key -0.014250182 -0.015387239 0.010163051 -0.019359159
## liveness 0.023913390 0.015616327 0.004411808 -0.073202424
## loudness 0.237058982 0.268033589 0.209447173 -0.087723119
## mode -0.013457062 -0.004185209 -0.047089636 0.085317965
## speechiness 0.062099487 0.042442715 0.105374038 -0.170553885
## tempo 0.032818633 0.023702499 0.010452936 -0.008087927
## time_signature 0.004728772 0.005876250 -0.007400160 0.030012350
## valence -0.067554295 -0.071581185 -0.054143720 -0.209173214
## acousticness danceability energy instrumentalness
## ArtistGen 0.01316773 -0.035442789 0.06387304 -0.017178344
## ArtistDebut 0.04877691 -0.053285013 0.04494368 -0.022263366
## popularity -0.01095923 0.002225069 0.02942726 -0.103616606
## duration 0.10654664 -0.207410267 -0.15194891 -0.065392822
## acousticness 1.00000000 -0.308316310 -0.65856905 0.015332382
## danceability -0.30831631 1.000000000 0.25229933 -0.052982777
## energy -0.65856905 0.252299330 1.00000000 -0.074246290
## instrumentalness 0.01533238 -0.052982777 -0.07424629 1.000000000
## key -0.01058317 0.024133395 0.01142881 -0.002945427
## liveness -0.08869619 -0.049754947 0.17836243 -0.014914857
## loudness -0.41056995 0.150430522 0.70962088 -0.191925850
## mode 0.16462169 -0.125608880 -0.16613186 0.001355848
## speechiness -0.14024612 0.043525005 0.19759760 -0.043589900
## tempo -0.08788863 -0.207198178 0.13761768 -0.007344857
## time_signature -0.14759122 0.131205123 0.16255464 -0.032769493
## valence -0.34318875 0.520565826 0.48114552 -0.058604738
## key liveness loudness mode
## ArtistGen -0.0142501822 0.0239133904 0.237058982 -0.013457062
## ArtistDebut -0.0153872391 0.0156163270 0.268033589 -0.004185209
## popularity 0.0101630513 0.0044118082 0.209447173 -0.047089636
## duration -0.0193591591 -0.0732024241 -0.087723119 0.085317965
## acousticness -0.0105831686 -0.0886961850 -0.410569949 0.164621692
## danceability 0.0241333947 -0.0497549467 0.150430522 -0.125608880
## energy 0.0114288128 0.1783624263 0.709620884 -0.166131860
## instrumentalness -0.0029454267 -0.0149148575 -0.191925850 0.001355848
## key 1.0000000000 -0.0001758457 0.003934826 -0.187091194
## liveness -0.0001758457 1.0000000000 0.089253003 -0.041983169
## loudness 0.0039348264 0.0892530029 1.000000000 -0.088897818
## mode -0.1870911943 -0.0419831687 -0.088897818 1.000000000
## speechiness 0.0273822199 0.0956207722 0.045248704 -0.097107694
## tempo -0.0071509873 0.0242618772 0.115799160 0.005119411
## time_signature 0.0048245791 0.0116518797 0.151265070 -0.027722047
## valence 0.0188664411 0.0457798082 0.319998708 -0.138146536
## speechiness tempo time_signature valence
## ArtistGen 0.06209949 0.032818633 0.004728772 -0.06755430
## ArtistDebut 0.04244271 0.023702499 0.005876250 -0.07158118
## popularity 0.10537404 0.010452936 -0.007400160 -0.05414372
## duration -0.17055388 -0.008087927 0.030012350 -0.20917321
## acousticness -0.14024612 -0.087888626 -0.147591224 -0.34318875
## danceability 0.04352500 -0.207198178 0.131205123 0.52056583
## energy 0.19759760 0.137617678 0.162554641 0.48114552
## instrumentalness -0.04358990 -0.007344857 -0.032769493 -0.05860474
## key 0.02738222 -0.007150987 0.004824579 0.01886644
## liveness 0.09562077 0.024261877 0.011651880 0.04577981
## loudness 0.04524870 0.115799160 0.151265070 0.31999871
## mode -0.09710769 0.005119411 -0.027722047 -0.13814654
## speechiness 1.00000000 0.127195540 0.013564347 0.13065639
## tempo 0.12719554 1.000000000 -0.041415907 0.03904957
## time_signature 0.01356435 -0.041415907 1.000000000 0.09944025
## valence 0.13065639 0.039049566 0.099440246 1.00000000
prior to any sort of data transformations, the only highly correlated variables are Artist Debut and Artist Generation. This multicolinearity will not affect our analysis since we will not be investigating the relationship between those two variables extensively. This observation is reasonable and should be expected since the concept of kpop generations is partly defined in when the artist debuted into the kpop market and the time in which they promoted their msuic.
The next highest correlation can be observed between energy and loudness with a correlation of 0.72207761. This is suggesting a moderately strong positive association between energy and loudness in which, as the energy level detected increases, the loudness of the music also increases and vice versa. The third strongest correlation value is -0.671291890 between energy and acousticness. This moderate negative association indicates a possible association where the increase of energy in a song would correspond with a decrease in acousticness of a song and vice versa.
With moderate positive associations we can observe the following relationships:
Popularity and Artist Generation: 0.573493768. Since the spotify algorithm measures popularity upon total plays within the recent time, some association between the score and time of when the song was being actively promoted is to be expected. This positive association would mean that the higher the popularity the later the generation we should expect (gen 4 rather than gen 1). What is surprising is that popularity and song's placement within a kpop generation is not stronger. This shows that many of the older songs are still listened to actively today.
Danceability and valence: 0.53901204. This positive relationship can be interpreted as with higher danceability, a song is expected to increase in valence (measures happier mood versus sadder moods). This relationship is reasonable since one would naturally be more drawn to dance to a happier song.
Energy and Valence: 0.48875566
transformdata <- mutate(numdata, speechiness = log(speechiness + 0.000000001), instrumentalness = log(instrumentalness + 0.000000001))
cor(transformdata)
## ArtistGen ArtistDebut popularity duration
## ArtistGen 1.000000000 0.906463428 0.572507933 -0.193551800
## ArtistDebut 0.906463428 1.000000000 0.566633425 -0.222470200
## popularity 0.572507933 0.566633425 1.000000000 -0.146654469
## duration -0.193551800 -0.222470200 -0.146654469 1.000000000
## acousticness 0.013167726 0.048776911 -0.010959230 0.106546635
## danceability -0.035442789 -0.053285013 0.002225069 -0.207410267
## energy 0.063873042 0.044943677 0.029427259 -0.151948909
## instrumentalness -0.195405508 -0.229965543 -0.142102756 -0.049222379
## key -0.014250182 -0.015387239 0.010163051 -0.019359159
## liveness 0.023913390 0.015616327 0.004411808 -0.073202424
## loudness 0.237058982 0.268033589 0.209447173 -0.087723119
## mode -0.013457062 -0.004185209 -0.047089636 0.085317965
## speechiness 0.088699763 0.069743321 0.126964249 -0.221039402
## tempo 0.032818633 0.023702499 0.010452936 -0.008087927
## time_signature 0.004728772 0.005876250 -0.007400160 0.030012350
## valence -0.067554295 -0.071581185 -0.054143720 -0.209173214
## acousticness danceability energy instrumentalness
## ArtistGen 0.01316773 -0.035442789 0.06387304 -0.19540551
## ArtistDebut 0.04877691 -0.053285013 0.04494368 -0.22996554
## popularity -0.01095923 0.002225069 0.02942726 -0.14210276
## duration 0.10654664 -0.207410267 -0.15194891 -0.04922238
## acousticness 1.00000000 -0.308316310 -0.65856905 -0.10973699
## danceability -0.30831631 1.000000000 0.25229933 0.08695894
## energy -0.65856905 0.252299330 1.00000000 0.01252531
## instrumentalness -0.10973699 0.086958941 0.01252531 1.00000000
## key -0.01058317 0.024133395 0.01142881 0.03140073
## liveness -0.08869619 -0.049754947 0.17836243 -0.01996440
## loudness -0.41056995 0.150430522 0.70962088 -0.18782269
## mode 0.16462169 -0.125608880 -0.16613186 -0.04880727
## speechiness -0.24339715 0.122368783 0.33357770 -0.05752435
## tempo -0.08788863 -0.207198178 0.13761768 0.01645483
## time_signature -0.14759122 0.131205123 0.16255464 -0.02863636
## valence -0.34318875 0.520565826 0.48114552 0.02924574
## key liveness loudness mode
## ArtistGen -0.0142501822 0.0239133904 0.237058982 -0.013457062
## ArtistDebut -0.0153872391 0.0156163270 0.268033589 -0.004185209
## popularity 0.0101630513 0.0044118082 0.209447173 -0.047089636
## duration -0.0193591591 -0.0732024241 -0.087723119 0.085317965
## acousticness -0.0105831686 -0.0886961850 -0.410569949 0.164621692
## danceability 0.0241333947 -0.0497549467 0.150430522 -0.125608880
## energy 0.0114288128 0.1783624263 0.709620884 -0.166131860
## instrumentalness 0.0314007266 -0.0199644032 -0.187822693 -0.048807273
## key 1.0000000000 -0.0001758457 0.003934826 -0.187091194
## liveness -0.0001758457 1.0000000000 0.089253003 -0.041983169
## loudness 0.0039348264 0.0892530029 1.000000000 -0.088897818
## mode -0.1870911943 -0.0419831687 -0.088897818 1.000000000
## speechiness 0.0283014090 0.1028917853 0.151848672 -0.141865199
## tempo -0.0071509873 0.0242618772 0.115799160 0.005119411
## time_signature 0.0048245791 0.0116518797 0.151265070 -0.027722047
## valence 0.0188664411 0.0457798082 0.319998708 -0.138146536
## speechiness tempo time_signature valence
## ArtistGen 0.08869976 0.032818633 0.004728772 -0.06755430
## ArtistDebut 0.06974332 0.023702499 0.005876250 -0.07158118
## popularity 0.12696425 0.010452936 -0.007400160 -0.05414372
## duration -0.22103940 -0.008087927 0.030012350 -0.20917321
## acousticness -0.24339715 -0.087888626 -0.147591224 -0.34318875
## danceability 0.12236878 -0.207198178 0.131205123 0.52056583
## energy 0.33357770 0.137617678 0.162554641 0.48114552
## instrumentalness -0.05752435 0.016454835 -0.028636360 0.02924574
## key 0.02830141 -0.007150987 0.004824579 0.01886644
## liveness 0.10289179 0.024261877 0.011651880 0.04577981
## loudness 0.15184867 0.115799160 0.151265070 0.31999871
## mode -0.14186520 0.005119411 -0.027722047 -0.13814654
## speechiness 1.00000000 0.134477583 0.046803510 0.22250167
## tempo 0.13447758 1.000000000 -0.041415907 0.03904957
## time_signature 0.04680351 -0.041415907 1.000000000 0.09944025
## valence 0.22250167 0.039049566 0.099440246 1.00000000
No major changes in correlations. AFter the transform there are no new strong correlations between instrumentalness or speechiness with other variables.